refactor:add parallelization optimization to bpcg by Missing-Hex · Pull Request #7416 · deepmodeling/abacus-develop

Missing-Hex · 2026-05-31T09:49:47Z

OpenMP Parallelization for BPCG CPU Kernels

Summary

This PR implements OpenMP parallelization for the two hotspot functions in the BPCG (Block Preconditioned Conjugate Gradient) diagonalization algorithm:

line_minimize_with_block_op<CPU>
calc_grad_with_block_op<CPU>

Motivation

The BPCG algorithm is an iterative diagonalization method used in ABACUS for solving the Kohn-Sham equations. Profiling shows that line_minimize_with_block_op and calc_grad_with_block_op are the primary hotspots within each iteration, consuming significant CPU time when processing multiple bands.

Since bands are independent of each other and access disjoint memory regions, parallelizing over the band dimension is both safe and efficient.

Changes Made

1. Parallelization Strategy

Both functions are restructured into multi-phase pipelines that separate compute-intensive loops from MPI collective operations:

Phase	Operation	Parallelization
Compute	BLAS dot products, normalization, accumulation	`#pragma omp parallel for schedule(static)`
Communication	`Parallel_Reduce::reduce_pool()`	Serial (batched array reduction)

2. Key Technical Decisions

Thread Safety

MPI collective operations (MPI_Allreduce via Parallel_Reduce::reduce_pool) are not thread-safe and are executed serially outside parallel regions
Compute loops are fully parallelized with no shared state between threads

Batched MPI Reduction

Original code: N scalar reductions → N MPI calls
Optimized code: 1 array reduction → 1 MPI call
Reduces MPI communication overhead significantly

Static Scheduling

schedule(static) is used because each band has equal workload (n_basis operations)
Provides optimal cache locality and minimal scheduling overhead

Conditional Compilation

All OpenMP pragmas are guarded by #ifdef _OPENMP
Code compiles and runs correctly when OpenMP is disabled

3. Memory Access Pattern

Each band accesses a contiguous memory block:

[band_idx * n_basis_max, (band_idx + 1) * n_basis_max)

This ensures:

No false sharing between threads
Efficient cache utilization
Predictable memory access patterns

Performance Impact

Theoretical Speedup

Compute-bound sections: Linear scaling with number of cores (up to n_band)
MPI communication: Reduced from O(N) calls to O(1) calls

Expected Behavior

Best case: Near-linear speedup for large n_band on multi-core systems
Communication overhead is amortized across all bands

Code Structure

`line_minimize_with_block_op<CPU>` (5 phases)

Parallel BLAS dot for per-band norms
Batch MPI reduction of norms
Parallel normalization and epsilon accumulation
Batch MPI reduction of epsilons
Parallel rotation application

`calc_grad_with_block_op<CPU>` (7 phases)

Parallel BLAS dot for per-band norms
Batch MPI reduction of norms
Parallel normalization and epsilon accumulation
Batch MPI reduction of epsilons
Parallel error and beta computation
Batch MPI reduction of errors and betas
Parallel gradient update and output

Testing

Correctness: Results match serial version
Thread safety: No data races detected
Performance: Benchmarked on multi-core systems
Compatibility: Builds with and without OpenMP

Files Modified

source/source_hsolver/kernels/bpcg_kernel_op.cpp

Backward Compatibility

No API changes
No interface modifications
Existing code continues to work without modification

Notes

The parallelization follows the same pattern already used in refresh_hcc_scc_vcc_op within the same file, ensuring consistency with existing codebase conventions.

Missing-Hex added 2 commits May 31, 2026 17:44

refactor:add parallelization optimization to bpcg

34f88d1

fix:add template float

137343b

mohanchen added the project_learning label Jun 1, 2026

Merge branch 'develop' into refactor/bpcg

890a085

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor:add parallelization optimization to bpcg#7416

refactor:add parallelization optimization to bpcg#7416
Missing-Hex wants to merge 3 commits into
deepmodeling:developfrom
Missing-Hex:refactor/bpcg

Missing-Hex commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Missing-Hex commented May 31, 2026

OpenMP Parallelization for BPCG CPU Kernels

Summary

Motivation

Changes Made

1. Parallelization Strategy

2. Key Technical Decisions

Thread Safety

Batched MPI Reduction

Static Scheduling

Conditional Compilation

3. Memory Access Pattern

Performance Impact

Theoretical Speedup

Expected Behavior

Code Structure

line_minimize_with_block_op<CPU> (5 phases)

calc_grad_with_block_op<CPU> (7 phases)

Testing

Files Modified

Backward Compatibility

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`line_minimize_with_block_op<CPU>` (5 phases)

`calc_grad_with_block_op<CPU>` (7 phases)